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Abstract —Content-based image retrieval (CBIR) of medical 
images is a crucial task that can contribute to a more reliable 
diagnosis if applied to big data. Recent advances in feature 
extraction and classification have enormously improved CBIR 
results for digital images. However, considering the increasing 
accessibility of big data in medical imaging, we are still in need of 
reducing both memory requirements and computational expenses 
of image retrieval systems. This work proposes to exclude the 
features of image blocks that exhibit a low encoding error 
when learned by a n/p/n autoencoder (p < n). We examine 
the histogram of autoendcoding errors of image blocks for each 
image class to facilitate the decision which image regions, or 
roughly what percentage of an image perhaps, shall be declared 
relevant for the retrieval task. This leads to reduction of feature 
dimensionality and speeds up the retrieval process. To validate 
the proposed scheme, we employ local binary patterns (LBP) and 
support vector machines (SVM) which are both well-established 
approaches in CBIR research community. As well, we use IRMA 
dataset with 14,410 x-ray images as test data. The results show 
that the dimensionality of annotated feature vectors can be 
reduced by up to 50 % resulting in speedups greater than 27 % 
at expense of less than 1% decrease in the accuracy of retrieval 
when validating the precision and recall of the top 20 hits. 

I. Problem and Motivation 

Searching for similar digital images in a given archive, 
or content-based image retrieval (CBIR), is necessary but 
a difficult task for several reasons. First of all, detecting 
similarity is a serious challenge. Answering the question what 
is similar to what is not easy when dealing with visual data at 
the pixel level. Secondly, measuring similarity is not an easy 
task either depending on how similarity has been quantified. 
And finally, searching in large archives takes time and can 
become infeasible when we are dealing with big data. 

Detecting and measuring similarities in large medical image 
archives may be a necessary task for diagnostic radiology, 
radiation oncology, cardiology and other clinical fields. Both 
the accuracy of retrieval and the speed of search become 
more significant in medical imaging as human life is in centre 
of attention. In contrast to non-medical images, the general 
appearance of a medical image may not be of interest where 
usually a certain part of the image, namely a region of interest 
(ROI) which could be an organ, a tumour or a specific tissue 
type is studied. This implies that many (small) regions of the 
image may be irrelevant for a specific retrieval task. 

II. The Idea 

The idea proposed in this paper is to detect irrelevant 
image blocks in each medical image class via analyzing the 


error histogram of a n/p/n autoencoder in order to reduce 
the dimensionality of features for image retrieval. We use an 
autoencoder with p<n to ensure that autoencoding significant 
image blocks is accompanied with a large error, hence, making 
the detection of irrelevant blocks easier. Thus, the hypothesis 
of this paper is that the relevance of image blocks is directly 
proportional to the error of an autoencoder when the hidden 
layer is smaller than the input/output layer assuming that 
images are widely free from noise. 

III. Literature Review 

The literature on content-based image retrieval (CBIR) is 
vast and stretches over publications of more than 20 years. 
In following we briefly review major CBIR works, elaborate 
on visual features for CBIR, and we also review some related 
works on autoencoders. 

CBIR - Medical imaging devices are producing large 
number of images as they have become more sophisticated 
offering both higher acquisition speeds and resolution. Per¬ 
forming image search based on visual information, generally 
called content-based image retrieval (CBIR), has increasingly 
become more difficult in recent years. This is, on one hand, 
because of diverse challenges preventing accurate similarity 
detection. But on the other hand, efficient processing of big 
data, or more precisely timely analysis of big image data, 
on ordinary computing devices with conventional algorithms 
appears to be a very daunting task. In the beginning era of 
digital image search, various searching methods were inves¬ 
tigated, although researchers were mainly focused on text- 
based search to retrieve images III El- Surveys provide an 
overview of literature 0 . The example of a complete overview 
of the first decade of research in this field is provided in 
0. Also, medical image retrieval systems, as a special sub- 
field of CBIR, have been reviewed in 0. CBIR systems help 
to retrieve, manage and navigate through huge visual data 
archives searchable when textual/visual queries are provided. 
Although CBIR systems differ in the methods applied to image 
in order to retrieve and store features and measure similarity, 
basic architectures of the systems are quite similar. Feature 
extraction and indexing or similarity measurement are two 
main processes of most CBIR systems. 

Feature extraction - The main purpose for extracting 
features is to create high-level descriptions from low-level data 
(pixel values). Recent medical image retrieval systems rely on 
visual features, such as color, shape, texture, and other spatial 


To appear in proceedings of The 5th Intern. Conf. on Image Processing Theory, Tools and Applications (IPTA’15), Nov 10-13, 2015, Orleans, France 


characteristics. Visual features are arranged in three levels: low 
level features (primitive), middle level features (logical) and 
high level features (abstract). Almost all early systems were 
based on low level features that capture characteristics (color, 
shape). But currently both mid-level(e.g. sub-image, bagging 
approach) and high-level (semantics) image representations 
are in demand. General visual features are implemented in 
most CBIR systems because of their independence from prior 
information and efficiency in computation (6j. The efficiency 
of a CBIR system depends, among others, on the quality of 
extracted features. If the features do not represent the image 
content adequately, similar images can hardly be retrieved. 

Gray level features - Color is one of the most commonly 
used features in CBIR QUO- Considering local gray level 
features of each pixel in the image, global gray level features 
of the whole image can be formed. Building a gray-level 
histogram is a popular method to extract global features from 
medical images mm tm a typical histogram is discretized 
into 256 bins. Being independent from changes in resolution 
and rotations are special advantages of the histogram method. 
Additionally, being simple to implement, efficient to compute 
and having low space requirements are some reasons why his¬ 
tograms are common in CBIR UlllfTSl . Nevertheless, possible 
assignments for similar color intensities to different bins 0 
and the absence of any spatial information m are main dis¬ 
advantages of using histograms. To overcome these problems, 
partition-based histograms that contain spatial information by 
splitting the image into multiple partitions and calculating 
local histograms have also been developed D3- Moreover, 
to solve the spatial information problem in a histogram, the 
color coherence vectors (CCV) method has been proposed 
ca. It investigates similar gray level regions in the image 
and count the number of pixels of these regions. The method 
compares the number of pixels in the region with a threshold 
and classifies them as coherent or incoherent. In this method, 
some spatial information is still missed. As well, determination 
of a threshold poses a potential problem. Another method 
is the gray level correlogram which is supposed to extract 
both gray-level and spatial information from an image ttHl- 
The method processes pixel position, intensity, probability of 
intensity and distance. 

Textural features - In medical imaging, textural features are 
one of most essential image features since gray levels may 
be incapable of effective object discrimination m. As well, 
texture features may generally contain crucial information to 
diagnose a disease. Smoothness, directionality, and random¬ 
ness are some textural properties m. There are different types 
of textural feature extraction methods, which are usually of 
statistical, geometrical or model-based nature. Texture features 
can provide the means to classify Oil and retrieve images 
fl~8l . Energy, entropy, contrast and homogeneity are some 
typical values to characterize a texture in an image tm 
Statistical methods represent textures by the statistical distri¬ 
bution of the image intensity, such as co-occurrence matrices, 
Tamura features l20l . Markov random field, fractal model, 
and multi-resolution filtering techniques. Furthermore, Local 


Binary Patterns (LBPs) have been implemented originally to 
describe texture of the images ET81221 . LBP is a practical 
method to quantify the gray level textures by utilizing patterns 
of local neighborhoods. The LBPs have been used in various 
applications for texture classification ED E3, face recogni¬ 
tion ll24) . fingerprint identification |25l . and automated cell 
phenotype image classification |26l . In ll27l LBPs are used to 
characterize medical images, for instance magnetic resonances 
and mammography images. LBP is widely considered as the 
state-of-the-art texture descriptor because of low computa¬ 
tional complexity and its invariance to changes in resolution. 
Recently, Tizhoosh introduced the concept of “barcodes” for 
image annotation that by using Radon transform may be a new 
binary approach to texture description |28l . 

Multimodal searching - After the feature extraction stage, 
each visual feature set is usually stored in a vector. Different 
strategies have been developed to use various modalities in 
searching for CBIR, such as constrained hierarchies or classes, 
early fusion and late fusion IZDBo). While the searching 
process is restricted with some hierarchies or classes in 
constrained methods, all images are searched in both early 
and late fusion methods. Using constrained method speeds 
up the retrieval process. Performing search within a local 
area (a certain class) or based on a hierarchical order takes 
less time than searching in the entire dataset. This method 
provides advantages especially for huge datasets as long as 
the class and hierarchy estimate do not fail ED. Image 
annotation and classification can be considered as a first step 
for speeding-up image retrieval in large databases. There are 
various approaches for image classification. SVM is a popular 
algorithm to perform reliable and generally fast classification 
62E3- For example, recently it has been proposed to use 
linear SVM methods with quadratic optimization method 
for CT brain images l34l . Also, SVM has been combined 
with K-NN classifiers lf35l and with boosting ll36l . Since 
we usually have to deal with high-dimensional feature space, 
most indexing methods cannot perform within reasonable 
time. Moreover, storage of these features constitutes another 
challenge for CBIR. Reduction of feature space without losing 
useful information is therefore a crucial step for both image 
annotation and retrieval. 

Autoencoders - Autoencoders are a a special type of neural 
networks to decode the encode inputs with minimum error. 
Introduced by Hinton et al. E23 to make backpropagation 
networks work without a teacher, autoencoders provide a very 
sophisticated unsupervised learning scheme. For instance, the 
denoising autoencoder can be trained to reconstruct a data 
from one of its corrupted versions H38l . Very deep autoen¬ 
coders can be initialized by learning many layers of features 
on color images lf39l . Autoencoders can then map images to 
short binary codes. As well, auoendocoders have been applied 
to compress mammograms by using image patches instead 
of the entire image go). ED adapt the autoencoder to the 
continuous case and use autoencoders for seismic waveforms, 
and offer a demonstration in which they compress 512-point 
waveforms to 32-element encodings. Il42ll put the use of deep 
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learning and autoencoders in perspective by a detailed and 
comprehensive investigation of representation learning. |43| 
attempts to propose a generalized autoencoder via manifold 
learning and uses it for digit and face image manifolds. HU 
presents a general mathematical framework for the study of 
both linear and non-linear autoencoders. |45l propose a feature 
ensemble learning method based on sparse autoencoders for 
image classification. To our knowledge, no work has proposed 
to use the autoencoding error to quantify the retrieval relevance 
of images in general and medical images in particular. As 
described in the next section, we train a shallow autoencoder 
and record the error histogram of each class to eliminate image 
blocks that are irrelevant for retrieval. 

IV. Our Approach: Autoencoding Relevance 

Not every pixel and not every region of an image may be 
relevant for a specific retrieval task. This becomes particularly 
significant for medical imaging where most of the time a 
certain region of interest, ROI, needs to be analyzed (e.g. a 
tumour, an organ). As a result, we envision an approach that 
attempts to eliminate small patches (blocks) of the image from 
feature extraction process. This selective reduction of image 
patches must be based on some universal criterion of relevance 
to ensure we have a generic approach that can be trained for 
different image modalities and specific ROIs. 

Autoencoders, with n/p/n architecture, encode n inputs 
into p positions, and then decode p positions back into n 
outputs. Generally, we may use p < n in which case au¬ 
toencoder functions as a compressor to reduce dimensionality. 
Such an autoencoder is basically a shallow neural network 
with some level of error. The error can be reduced if we 
deepen the network, for instance make it an n/m/k/p/k/m/n 
autoencoder with n <m < k <p. However, this is not what 
we intend to do. We would like to design a shallow network 
to keep the decoding error high. But why? 

Using a shallow network, a n/p/n autoencoder, with a 
relatively high error for decoding image blocks, captured 
in a histogram matrix H for each class, will enable us to 
locate retrieval-irrelevant image regions. If the image block 
contains complex structures (edges, textures etc.), then the 
encoding error is expected to be high for n/p/n autoencoder 
specially when we ascertain that p < n. In contrast, if we 
are encoding blocks of uniform regions with no significant 
gradient change, then we shall expect low decoding error. The 
blocks, therefore, with lowest decoding errors are the ones 
with the least contribution to accurate retrieval (or so is our 
assumption to be validated in experimentation). 

If we divide image I into k x k blocks and desire that 
d £ [0,1) fraction of the image area be reduced for retrieval- 
oriented feature extraction, then the task is to eliminate as 
many as [d x k x k\ blocks by not extracting features from 
them. If we record the autoencoding errors for all blocks of 
a certain image class, then this can be done by ignoring the 
blocks below an error threshold that eliminates [d x k x k\ 
blocks. Of course, this assumes that an inter-class method, e.g. 
SVM, has already classified the query image and assigned it 


Algorithm 1 Proposed approach 
l:-Configuration- 

2: Set k to divide the image into k x k blocks 
3: Set the desired reduction rate d £ [0,1) 

4: Set n/p/n for the autoencoder (p < n ) 

5:-Training- 

6: Get the number of training images m max 
7: Initialize the feature matrix F 
8: Initialize the class vector c 
9: Initialize the error histogram H for all classes 
10 : for each i £ { 1 , 2 ,..., m max } do 
11: Read the training image I, and its class c, 

12; C ^— Ci 

13: for each j £ {1,2,..., k x k} do 

14: B j £- currentBlock(Ii) 

15: f £- extractLBPfeatures(Bj) 

16: F 4— appendFeatures{ F,f) 

17: error<^autoEncode(Bj ) 

18: H (a,j) £- H (Ci,j)+ error 

19: end for 

20: end for 

21: [vi,V 2 ,...] 4— TrainSVM (F. c) 

22: Save support vectors Vi, V 2 ,... 

23: Save the error histogram H 
24:- Testing - 

25: Read vi, V 2 ,... and H and a new image I new 
26 : for each j £ {1,2,..., k x k} do 
27: Bj £- currentBlock{ I new ) 

28: f 4— extractLBPfeatures(Bj) 

29: end for 

30: c new 4— classifySVM (f) 

31: {' 4— ignoreBlocks( f, H, c new ) 

32: < I*, 1% , /J, • • • >£- calculateSimilarity(F , f', H, c ne w) 
33: Show retrieved images < /*. P/. I/. - ■ ■ > 


to a certain class. The proposed reduction of feature dimen¬ 
sionality using autoencoding error analysis occurs to improve 
the intra-class retrieval task. Figure [1] illustrates the idea of 
relevance quantification via autoencoding error histogram. As 
well. Algorithm [I] describes the proposed approach. In order to 
implement a complete solution, we use LBP features and SVM 
to classify the images. One can, in future works, investigate 
the use of opposites as already reported in iterative for learning 
and optimization in order to see the effect of using a network 
incorporating opposites f46l . l47l . Il48l . 

V. Experiments and Results 

In this section, we first provide information about the 
benchmark data. The error measurement for classification is 
described next. Subsequently, we detail the accuracy measures 
for the retrieval task. The settings for LBP and SVM are 
described afterward. Lastly, the results will be reported. 

Image Dataset - The Image Retrieval in Medical Applica¬ 
tions (IRMA) 2009 database is a collection of 14,410 x-ray 
images that have been randomly collected from daily routine 
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Fig. 1. Schematic illustration of the proposed approach: Image blocks are autoencoded in an n/p/n architecture. The encoding error of each block is 
recorded for each image class to create the error histogram. A desired reduction (e.g. 25% = 4 blocks) can be used to establish a threshold T in order to 
exclude a number of blocks (gray blocks) from feature extraction. 



Fig. 2. Sample images from IRMA dataset. 


work at the Department of Diagnostic Radiology of the RWTH 
Aachen University (Fig. [2]». The downscaled images were 
collected from different ages, genders, view positions, and 
pathologies |33;|. Each image in the dataset has an IRMA code. 
According to these codes, 193 classes are defined according 
to 2008 IRMA codes. The IRMA code comprises four axes 
with three to four positions each: 1) the technical code (T) 
(modality), 2) the directional code (D) (body orientations), 3) 
the anatomical code (A) (body region), and 4) the biological 
code (B) (the biological system examined). The complete 
IRMA code consists of 13 characters TTTT-DDD-AAA-BBB, 
with each character in {0,..., 9; a,..., z }. As many as 12,677 
images are separated for training. The remaining 1,733 images 
are used as test data. In this project, the IRMA 2009 dataset 
has been used with specified 2008 IRMA labels (consisting of 
193 classes) for retrieval purposes. Otherwise, same dataset 
is utilized with general 2005 IRMA labels (consisting of 57 
classes) for classification purpose. 2005 IRMA labels are more 
general than 2008 IRMA labels because it has been made of 
6 characters from top of hierarchical classes, TT-D-AA-B. In 
2009 dataset, each image can not have been coded according 
to 2005 IRMA coding regularity. A total number of 12,631 
images from training set and 1,639 images from testing set 
have 2005 IRMA codes. For this reason, SVM classification 
is implemented on corresponding images. 

Error Measurement for Classification - The ImageCLEF 


project has defined an error score evaluation method in order to 
evaluate the classification performance of methods on IRMA 
dataset l33l . As in IRMA dataset all images are labelled with 
the technical, directional, anatomical and biological indepen¬ 
dent axes, the error E can be defined as follows 

n i i 

2=1 1 

where bi is number of possible labels at position i and S 
is the decision function delivering 1 for wrong label and 0 
for correct label when the IRMA codes of the image It is 
compared with the IRMA code of the image /,. For every 
axis, the maximal possible error is computed and the errors 
are normalized between 0.25 and 0. If all positions in all axes 
are wrong, error value is 1. 

Accuracy Measurement for Retrieval - Looking at the 
top to retrieved images, the number of correctly retrieved 
images (true positives T p ) and wrongly retrieved images (false 

positives F p ) can be used to calculate the precision P top m 

T 

of the retrieval: P t opm = T + ' F ■ Analogously, using the top 
to retrieved images, the number of correctly retrieved images 
(true positives T p ) and wrongly not-retrieved images (false 
negative F n ) can be used to calculate the recall R top m of the 
retrieval: R top m = T ^ Fn ■ 

LBP and SVM Settings - We extracted local binary 
patterns from 3x3 neighbourhoods within each image block, 
converted the binary numbers to decimal numbers to calculate 
a histogram h LBP . These histograms were used for both 
classification and retrieval. To classify the input image we used 
support vector machines (SVM). The LBP histogram features 
form IRMA training dataset are used to train the multi-class 
SVM with radial basis function as its kernel m. 

Results - We first classified the images with SVM using 
LBP features. We tested to extract LBP features for the entire 
image but that led to a significant decrease in classification 
accuracy (« 50%). Analog to global versus local thresholding, 
it appears that calculating LBP histograms for image blocks 
is more capable of capturing the spatial characteristics of the 
image compared to extracting only one LBP histogram for the 
entire image. As Table |T| illustrates, the LBP-SVM approach 
achieves the lowest error score and hence the highest accuracy 
for 4 x 4 blocks (image divided into 16 regions). 
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TABLE I 

SVM ACCURACY WITH LBP FEATURES FOR CLASSIFICATION OF 2005 
IRMA IMAGE DATASET CONTAINING 14,270 IMAGES CONSTITUTING 57 
IMAGE CATEGORIES. 


Blocks 

Accuracy 

Error score 

4x4 

75.50% 

116.77 

5x5 

74.31% 

121.34 

6x6 

72.23% 

133.90 


For the retrieval, we autoencoded the image blocks first 
using the restricted Boltzmann machine (RBM) function. We 
did try different values for p but as long as p < n was 
maintained, the error levels were not considerably affected. 
The number of iterations for the autoencoder was set to 5 as we 
observed that more iterations did not change our results. We 
used cross correlation to measure the similarity of two feature 
vectors for two images, and we compared the cases with no 
reduction (all mage blocks considered for feature calculation) 
and different reduction levels (1/8, 1/4 and 1/2 corresponding 
to 12.5%, 25% and 50%, respectively). The precision and 
recall were calculated when the top 10, 20 and 30 images 
were in focus. Retrieval times were recorded as well. Table [TT] 
provides the averages of 100 runs for different settings. Table 
summarizes the results. It is obvious that for finer grid 
structures (6x6 blocks) the time savings of greater than 27% 
can be achieved where 50% of the image blocks have been 
ignored resulting in 50% reduction of the feature vector size. 
This becomes a significant result when we observe a slight 
decrease in accuracy (both precision and recall) less than 1% 
for the top 20 hits of the retrieval. One should note that a full, 
one-to-one translation of the space savings (namely 50%) into 
computational savings (here 27%) may not be possible because 
of the intrinsic difference in space-time relationship and with 
respect to specifications of the algorithmic steps involved in 
saving and processing tasks. 

VI. Summary 

Searching for similar images in large medical image 
archives is both necessary and challenging. Whereas we can 
classify the query image in a very short time to assign it to an 
existing image category, the actual retrieval of similar images 
may need more computational resources through more costly 
and one-by-one comparisons. This becomes a serious obstacle 
for medical imaging with emerging big image data availability. 
There are many methods to reduce the dimensionality of 
image classification and retrieval tasks. Autoencoders have 
been investigated in the past with respect to their compression 
capabilities. In this work, we proposed a different approach 
to data reduction. Motivated by the fact that in medical image 
analysis usually a certain region of interest, ROI, is in focus of 
user evaluation, we proposed to eliminate some image patches 
(rectangular blocks) from feature extraction process. This leads 
to reduction of both memory requirements and computational 
expense of the retrieval task. To decide which image blocks are 
rather irrelevant for the retrieval process, we trained a n/p/n 
autoencoder (p < n) with the image blocks as both in- and 


output. We recorded the autoencoding errors in a histogram 
for each image class. This histogram is then thresholded to 
exclude a certain percentage of the image area (which has low 
autoencoding error and does not contribute to image retrieval 
task), in terms of number of image blocks, for each new image. 
Experiments with IRMA dataset with 14,410 x-ray images 
showed that, accepting a slight decrease in precision and recall 
for the top 20 hits, the space requirements for the annotated 
feature vectors can be cut down by 50% where simultaneously 
the speed of the retrieval can be increased by 27%. 
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TABLE II 

The AVERAGE PRECISION P, RECALL R AND TIME t ARE MEASURED FOR TOP 10, TOP 20 AND TOP 30 RETRIEVAL FOR RANDOMLY SELECTED IMAGE 
FROM 57 CLASSES AND REPEATED 100 TIMES. AS DISTANCE METRIC, CROSS-CORRELATION WAS USED. 


Blocks 

Reduction 

Pjop 10 

Plop 10 

Plop 20 

Plop 20 

-Prop 30 

Plop 30 

f(sec) 

4x4 

0 

0.872 

0.1807 

0.868 

0.2527 

0.882 

0.3214 

0.01807 

5x5 

0 

0.874 

0.1783 

0.873 

0.2538 

0.883 

0.3222 

0.02154 

6x6 

0 

0.871 

0.1792 

0.877 

0.2542 

0.882 

0.3220 

0.02620 
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